Instructions and Expectations

Background

The presidential election in 2012 did not come as a surprise. Some correctly predicted the outcome of the election correctly including Nate Silver, and many speculated his approach.

Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it was a clear example that even the current state-of-the-art technology can surprise us. Predicting voter behavior is complicated for many reasons despite the tremendous effort in collecting, analyzing, and understanding many available datasets. For our final project, we will analyze the 2016 presidential election dataset.

Answer the following questions in one paragraph for each.

  1. What makes voter behavior prediction (and thus election forecasting) a hard problem?

  2. What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions?

  3. What went wrong in 2016? What do you think should be done to make future predictions better?

Data

election.raw = read.csv("data/election/election.csv") %>% as.tbl
census_meta = read.csv("data/census/metadata.csv", sep = ";") %>% as.tbl
census = read.csv("data/census/census.csv") %>% as.tbl

Election data

The meaning of each column in election.raw is clear except fips. The accronym is short for Federal Information Processing Standard.

In our dataset, fips values denote the area (US, state, or county) that each row of data represent. For example, a fips value of 6037 denotes Los Angeles County.

county fips candidate state votes
Los Angeles County 6037 Hillary Clinton CA 2464364
Los Angeles County 6037 Donald Trump CA 769743
Los Angeles County 6037 Gary Johnson CA 88968
Los Angeles County 6037 Jill Stein CA 76465
Los Angeles County 6037 Gloria La Riva CA 21993

Some rows in election.raw are summary rows and these rows have county value of NA. There are two kinds of summary rows:

  • Federal-level summary rows have a fips value of US.
  • State-level summary rows have the respective state name as the fips value.
  1. Report the dimension of election.raw after removing rows with fips=2000. Provide a reason for excluding them. Please make sure to use the same name election.raw before and after removing those observations.
election.raw <- election.raw %>% 
  filter(fips != "2000")

New dimensions: 5 columns x 18345 rows

Alaska has a fips value of 2000, so the rows where fips=2000 are indeed state-level summary of election results. However, the state-level summary rows of Alaska are already available when we read the data, so it makes no sense to have duplicate records.

Census data

Following is the first few rows of the census data:

State County TotalPop Men Women Hispanic White Black Native Asian Pacific Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Construction Production Drive Carpool Transit Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment
Alabama Autauga 1948 940 1008 0.9 87.4 7.7 0.3 0.6 0.0 1503 61838 11900 25713 4548 8.1 8.4 34.7 17.0 21.3 11.9 15.2 90.2 4.8 0 0.5 2.3 2.1 25.0 943 77.1 18.3 4.6 0 5.4
Alabama Autauga 2156 1059 1097 0.8 40.4 53.3 0.0 2.3 0.0 1662 32303 13538 18021 2474 25.5 40.3 22.3 24.7 21.5 9.4 22.0 86.3 13.1 0 0.0 0.7 0.0 23.4 753 77.0 16.9 6.1 0 13.3
Alabama Autauga 2968 1364 1604 0.0 74.5 18.6 0.5 1.4 0.3 2335 44922 5629 20689 2817 12.7 19.7 31.4 24.9 22.1 9.2 12.4 94.8 2.8 0 0.0 0.0 2.5 19.6 1373 64.1 23.6 12.3 0 6.2
Alabama Autauga 4423 2172 2251 10.5 82.8 3.7 1.6 0.0 0.0 3306 54329 7003 24125 2870 2.1 1.6 27.0 20.8 27.0 8.7 16.4 86.6 9.1 0 0.0 2.6 1.6 25.3 1782 75.7 21.2 3.1 0 10.8
Alabama Autauga 10763 4922 5841 0.7 68.5 24.8 0.0 3.8 0.0 7666 51965 6935 27526 2813 11.4 17.5 49.6 14.2 18.2 2.1 15.8 88.0 10.5 0 0.0 0.6 0.9 24.8 5037 67.1 27.6 5.3 0 4.2
Alabama Autauga 3851 1787 2064 13.1 72.9 11.9 0.0 0.0 0.0 2642 63092 9585 30480 7550 14.4 21.9 24.2 17.5 35.4 7.9 14.9 82.7 6.9 0 0.0 6.0 4.5 19.8 1560 79.4 14.7 5.8 0 10.9

Census data: column metadata

Column information is given in the metadata file.

Data wrangling

  1. Move summary rows from election.raw data into federal or state level summary files: i.e.,

    • Federal-level summary into a election_federal.

    • State-level summary into a election_state.

    • Only county-level data is to remain in election.

election_federal <- election.raw %>% 
  filter(fips == "US")

election_state <- election.raw %>% 
  filter(fips != "US" & is.na(county))

election <- election.raw %>% 
  filter(!is.na(county))
  1. How many named presidential candidates were there in the 2016 election? Draw a bar chart of all votes received by each candidate. You can split this into multiple plots or may prefer to plot the results on the log scale. Either way, the results should be clear and legible!

There were 31 candidates in the 2016 election.

order_candidates <- election_federal[order(election_federal$votes),]
candiates_total <- length(unique((election.raw$candidate)))
by_candidate <- election.raw %>% 
  group_by(candidate) %>% 
  summarise(total_votes = sum(votes))


candidates <- ggplot(election_federal, aes(x=reorder(candidate, -votes), y=votes)) + 
  geom_bar(stat="identity", fill = "#56B4E9") + 
  labs(title='Candidate National Votes', x = 'Candidate Name', y = 'Vote Count') +
  theme(axis.text.x= element_text(angle=-75, size = 8, hjust = .2), panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank(), axis.line = element_line(colour = "black"), plot.title = element_text(hjust = 0.5)) 

candidates

  1. Create variables county_winner and state_winner by taking the candidate with the highest proportion of votes. Hint: to create county_winner, start with election, group by fips, compute total votes, and pct = votes/total. Then choose the highest row using top_n (variable state_winner is similar).
county_winner <- election %>% 
  group_by(fips) %>% 
  mutate(total = sum(votes)) %>% 
  mutate(pct = votes/total) %>% 
  top_n(n=1, pct)

state_winner <- election_state %>% 
  group_by(fips) %>% 
  mutate(total = sum(votes)) %>% 
  mutate(pct = votes/total) %>% 
  top_n(n=1, pct)

Visualization

Visualization is crucial for gaining insight and intuition during the data mining process. To that end, we will generate cartographic representations (maps) of the states and counties, and map our data onto these representations.

The R package ggplot2 can be used to draw maps. Consider the following code.

states <- map_data("state")

ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE)  # color legend is unnecessary and takes too long

The variable states contains information to draw white polygons, while the fill-colors are determined by region.

  1. Draw a county-level map by creating counties = map_data("county") and color by county.
counties = map_data("county")
ggplot(data = counties) +
  geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") + 
  coord_fixed(1.3) + 
  guides(fill=FALSE)

  1. Now color the map by the winning candidate for each state. First, use left_join() to combine the states variable and the state_winner variable we created earlier. Note that left_join() needs to match up values of states to join the tables. A call to left_join() takes all the values from the first table and looks for matches in the second table. For each match, left_join() appends the data from the second table to the matching row in the first; if no matching value is found, it adds missing values:

    Here, we’ll be combing the two datasets based on state name. However, the state names are in different formats in the two tables: e.g. AZ vs. arizona. Before using left_join(), create a common column by creating a new column for states named fips = state.abb[match(some_column, some_function(state.name))]. Replace some_column and some_function to complete creation of this new column. Then left_join(). Your figure will look similar to this state level New York Times map.

states$fips = state.abb[match(states$region, cbind(tolower(state.name), state.abb))]
state_winner$fips <- as.factor(state_winner$fips)

states <- left_join(states, state_winner, by = 'fips')
## Warning: Column `fips` joining character vector and factor, coercing into
## character vector
ggplot(data = states) +
  geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") + 
  coord_fixed(1.3) + 
  guides(fill=FALSE)

  1. The variable county does not have a fips value. So we will create one by pooling information from maps::county.fips. Split the polyname column to region and subregion. Use left_join() to combine county.fips into county. Also, left_join() previously created variable county_winner. Your figure will look similar to county-level New York Times map.
county.fips <- maps::county.fips
county.fips$fips <- as.factor(county.fips$fips)

county.fips <- county.fips %>%  
  separate(polyname, c('region', 'subregion'), sep = ",") 

county.fips = left_join(county.fips, county_winner, by= 'fips')
## Warning: Column `fips` joining factors with different levels, coercing to
## character vector
counties = left_join(counties, county.fips, by=c('region', 'subregion'))

ggplot(data = counties) + 
  geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white", size = .1) + 
  coord_fixed(1.3) +
  guides(fill=FALSE)

  1. Create a visualization of your choice using census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.

  2. The census data contains high resolution information (more fine-grained than county-level).
    In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Create the following variables:

    • Clean census data census.del: start with census, filter out any rows with missing values, convert {Men, Employed, Citizen} attributes to percentages (meta data seems to be inaccurate), compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific}, remove these variables after creating Minority, remove {Walk, PublicWork, Construction}.
      Many columns seem to be related, and, if a set that adds up to 100%, one column will be deleted. E.g., Men and Women comprise 100% of the TotalPop, so we only two of the counts to know the third, and would choose one to delete.
census.del = census[complete.cases(census),] %>% 
  mutate(Men = Men / TotalPop, Employed = Employed / TotalPop, Citizen = Citizen / TotalPop) %>% 
  mutate(Minority = Hispanic + Black + Native + Asian + Pacific) %>% 
  select(-Walk, -PublicWork, -Construction, -Hispanic, -Black, -Native, -Asian, -Pacific, -Women)
* _Sub-county census data, `census.subct`_: 
  start with `census.del` from above, `group_by()` two attributes {`State`, `County`}, 
  use `add_tally()` to compute `CountyTotal`. Also, compute the weight by `TotalPop/CountyTotal`.
census.subct = census.del %>% 
  group_by(State,County) %>% 
  add_tally(TotalPop) %>% 
  rename(CountyTotal = n) %>% 
  mutate("Weight" = TotalPop/CountyTotal)%>% mutate_at(vars(c(Men:Citizen),c(Poverty:Minority)),funs(.*Weight)) #only include columns selected and then multiply all by the weight 
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.
* _County census data, `census.ct`_: 
  start with `census.subct`, use `summarize_at()` to compute the weighted sum.
census.ct = census.subct %>% 
  group_by(State, County) %>% 
  summarize_at(vars(c(Men:Minority)), funs(sum(.)))
* _Print the first few rows of `census.ct`_: 
head(census.ct)
## # A tibble: 6 x 27
## # Groups:   State [1]
##   State County   Men White Citizen Income IncomeErr IncomePerCap
##   <fct> <fct>  <dbl> <dbl>   <dbl>  <int>     <int>        <int>
## 1 Alab… Autau… 0.484  75.8   0.737 6.00e5     96427       292643
## 2 Alab… Baldw… 0.488  83.1   0.757 1.51e6    287310       832126
## 3 Alab… Barbo… 0.538  46.2   0.769 2.91e5     53023       153941
## 4 Alab… Bibb   0.534  74.5   0.774 1.61e5     24571        75228
## 5 Alab… Blount 0.494  87.9   0.734 4.06e5     80281       181542
## 6 Alab… Bullo… 0.530  22.2   0.755 1.00e5     25534        53205
## # … with 19 more variables: IncomePerCapErr <int>, Poverty <dbl>,
## #   ChildPoverty <dbl>, Professional <dbl>, Service <dbl>, Office <dbl>,
## #   Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>,
## #   OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>,
## #   Employed <dbl>, PrivateWork <dbl>, SelfEmployed <dbl>,
## #   FamilyWork <dbl>, Unemployment <dbl>, Minority <dbl>
  1. If you were physically located in the United States on election day for the 2016 presidential election, what state and county were you in? Compare and contrast these county results, demographic information, etc., against the state it is located in. If you were not in the United States on election day, select a county that appears to stand apart from the ones surrounding it. Do you find anything unusual or surprising? If not, what do you hypothesise might be the reason for the county and state results?

Dimensionality reduction

  1. Run PCA for both county & sub-county level data. Save the first two principle components PC1 and PC2 into a two-column data frame, called ct.pc and subct.pc, for county and sub-county respectively. Discuss whether you chose to center and scale the features before running PCA and the reasons for your choice. What are the three features with the largest absolute values of the first principal component? Which features have opposite signs and what does that mean about the correaltion between these features?
pr_ct = census.ct %>% 
  ungroup %>% 
  select(c(Men:Minority)) %>% 
  prcomp(scale = TRUE)

pr_subct = census.subct %>% 
  ungroup %>% 
  select(c(Men:Minority)) %>% 
  prcomp(scale = TRUE)

ct.pc = as.data.frame(pr_subct$rotation[,1:2])

ct.pc.abs = abs(ct.pc)
ct.pc.order = order(ct.pc.abs)

subct.pc = as.data.frame(pr_subct$rotation[,1:2])
subct.pc.abs = abs(ct.pc)

kable(subct.pc[order(abs(pr_subct$rotation[,1]), decreasing=TRUE), ], col.names = c("PC1 Ordered", "PC2"), caption = "Subcounty PCA Loadings")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
Subcounty PCA Loadings
PC1 Ordered PC2
Men 0.2517602 -0.0355148
Citizen 0.2516750 -0.0371084
PrivateWork 0.2492251 -0.0250022
Drive 0.2489331 -0.0253583
Office 0.2449174 -0.0239673
Employed 0.2442508 -0.0589544
Service 0.2417955 -0.0007899
Professional 0.2410704 -0.0747356
MeanCommute 0.2401601 -0.0235876
White 0.2344204 -0.0658673
Carpool 0.2314233 -0.0169159
Production 0.2280590 0.0096322
Poverty 0.2251172 0.0544654
ChildPoverty 0.2178755 0.0547904
SelfEmployed 0.2152096 -0.0820949
Unemployment 0.2057795 0.0562520
WorkAtHome 0.1886831 -0.0985581
Minority 0.1651724 0.0587213
OtherTransp 0.1294814 -0.0088174
FamilyWork 0.1184700 -0.0668117
Transit 0.0688494 -0.0362329
Income -0.0539271 -0.4942202
IncomeErr -0.0502902 -0.4396261
IncomePerCap -0.0487308 -0.5435088
IncomePerCapErr -0.0416104 -0.4619181
kable(subct.pc[order(abs(pr_subct$rotation[,2]), decreasing=TRUE), ], col.names = c("PC1", "PC2 Ordered"), caption = "Subcounty PCA Loadings")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
Subcounty PCA Loadings
PC1 PC2 Ordered
IncomePerCap -0.0487308 -0.5435088
Income -0.0539271 -0.4942202
IncomePerCapErr -0.0416104 -0.4619181
IncomeErr -0.0502902 -0.4396261
WorkAtHome 0.1886831 -0.0985581
SelfEmployed 0.2152096 -0.0820949
Professional 0.2410704 -0.0747356
FamilyWork 0.1184700 -0.0668117
White 0.2344204 -0.0658673
Employed 0.2442508 -0.0589544
Minority 0.1651724 0.0587213
Unemployment 0.2057795 0.0562520
ChildPoverty 0.2178755 0.0547904
Poverty 0.2251172 0.0544654
Citizen 0.2516750 -0.0371084
Transit 0.0688494 -0.0362329
Men 0.2517602 -0.0355148
Drive 0.2489331 -0.0253583
PrivateWork 0.2492251 -0.0250022
Office 0.2449174 -0.0239673
MeanCommute 0.2401601 -0.0235876
Carpool 0.2314233 -0.0169159
Production 0.2280590 0.0096322
OtherTransp 0.1294814 -0.0088174
Service 0.2417955 -0.0007899
kable(ct.pc[order(abs(pr_ct$rotation[,1]), decreasing=TRUE), ], col.names = c("PC1 Ordered", "PC2"), caption = "County PCA Loadings")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
County PCA Loadings
PC1 Ordered PC2
Poverty 0.2251172 0.0544654
ChildPoverty 0.2178755 0.0547904
Unemployment 0.2057795 0.0562520
Employed 0.2442508 -0.0589544
Minority 0.1651724 0.0587213
White 0.2344204 -0.0658673
WorkAtHome 0.1886831 -0.0985581
Professional 0.2410704 -0.0747356
Service 0.2417955 -0.0007899
SelfEmployed 0.2152096 -0.0820949
MeanCommute 0.2401601 -0.0235876
Drive 0.2489331 -0.0253583
FamilyWork 0.1184700 -0.0668117
Production 0.2280590 0.0096322
Carpool 0.2314233 -0.0169159
Office 0.2449174 -0.0239673
Citizen 0.2516750 -0.0371084
Men 0.2517602 -0.0355148
Income -0.0539271 -0.4942202
IncomePerCap -0.0487308 -0.5435088
OtherTransp 0.1294814 -0.0088174
PrivateWork 0.2492251 -0.0250022
IncomePerCapErr -0.0416104 -0.4619181
IncomeErr -0.0502902 -0.4396261
Transit 0.0688494 -0.0362329
kable(ct.pc[order(abs(pr_ct$rotation[,2]), decreasing=TRUE), ], col.names = c("PC1", "PC2 Ordered"), caption = "County PCA Loadings")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
County PCA Loadings
PC1 PC2 Ordered
IncomePerCap -0.0487308 -0.5435088
IncomePerCapErr -0.0416104 -0.4619181
IncomeErr -0.0502902 -0.4396261
Income -0.0539271 -0.4942202
Transit 0.0688494 -0.0362329
Citizen 0.2516750 -0.0371084
Professional 0.2410704 -0.0747356
White 0.2344204 -0.0658673
Minority 0.1651724 0.0587213
Production 0.2280590 0.0096322
Office 0.2449174 -0.0239673
SelfEmployed 0.2152096 -0.0820949
Drive 0.2489331 -0.0253583
MeanCommute 0.2401601 -0.0235876
PrivateWork 0.2492251 -0.0250022
Employed 0.2442508 -0.0589544
Men 0.2517602 -0.0355148
FamilyWork 0.1184700 -0.0668117
OtherTransp 0.1294814 -0.0088174
Carpool 0.2314233 -0.0169159
ChildPoverty 0.2178755 0.0547904
Unemployment 0.2057795 0.0562520
Poverty 0.2251172 0.0544654
WorkAtHome 0.1886831 -0.0985581
Service 0.2417955 -0.0007899
  1. Determine the minimum number of PCs needed to capture 90% of the variance for both the county and sub-county analyses. Plot the proportion of variance explained (PVE) and cumulative PVE for both county and sub-county analyses.

After the 7th Principal Component, 90 percent of the variance is explained at the sub-county level. After the 12th Principal Component, 90 percent of the variane is explained at the county level.

pc_var_subct = pr_subct$sdev^2
pve_subct = pc_var_subct/sum(pc_var_subct)
cumulative_pve_subct = cumsum(pve_subct)

min(which(cumulative_pve_subct > .9))
## [1] 7
pc_var_ct = pr_ct$sdev^2
pve_ct = pc_var_ct/sum(pc_var_ct)
cumulative_pve_ct = cumsum(pve_ct)

min(which(cumulative_pve_ct > .9))
## [1] 12

Clustering

  1. With census.ct, perform hierarchical clustering with complete linkage. Cut the tree to partition the observations into 10 clusters. Re-run the hierarchical clustering algorithm using the first 5 principal components of ct.pc as inputs instead of the originald features. Compare and contrast the results. For both approaches investigate the cluster that contains San Mateo County. Which approach seemed to put San Mateo County in a more appropriate cluster? Comment on what you observe and discuss possible explanations for these observations.

Classification

In order to train classification models, we need to combine county_winner and census.ct data. This seemingly straightforward task is harder than it sounds. The following code makes the necessary changes to merge them into election.cl for classification.

tmpwinner <- county_winner %>% ungroup %>%
  mutate(state = state.name[match(state, state.abb)]) %>%               ## state abbreviations
  mutate_at(vars(state, county), tolower) %>%                           ## to all lowercase
  mutate(county = gsub(" county| columbia| city| parish", "", county))  ## remove suffixes
tmpcensus <- census.ct %>% mutate_at(vars(State, County), tolower)

election.cl <- tmpwinner %>%
  left_join(tmpcensus, by = c("state"="State", "county"="County")) %>% 
  na.omit

## save meta information
election.meta <- election.cl %>% select(c(county, fips, state, votes, pct, total))

## save predictors and class labels
election.cl = election.cl %>% select(-c(county, fips, state, votes, pct, total))

Using the following code, partition data into 80% training and 20% testing:

set.seed(10) 
n <- nrow(election.cl)
in.trn <- sample.int(n, 0.8*n) 
trn.cl <- election.cl[ in.trn,]
tst.cl <- election.cl[-in.trn,]

Using the following code, define 10 cross-validation folds:

set.seed(20) 
nfold <- 10
folds <- sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))

Using the following error rate function:

calc_error_rate = function(predicted.value, true.value){
  return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso")
  1. Decision tree: train a decision tree by cv.tree(). Prune the resulting tree to minimize misclassification error. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to a records variable. Intepret and discuss the results of the decision tree analysis. Use this plot to tell a story about voting behavior in the US (remember the NYT infographic?)

  2. Run a logistic regression to predict the winning candidate in each county. Save training and test errors to the records variable. What are the significant variables? Are these consistent with what you observed in the decision tree analysis? Interpret the meaning of a couple of the significant coefficients in terms of a unit change in the variables. Did your particular county (from question 13) results match the predicted results?

  3. You may notice that you get a warning glm.fit: fitted probabilities numerically 0 or 1 occurred. As we discussed in class, this is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner). This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. Use the cv.glmnet function from the glmnet library to run K-fold cross validation and select the best regularization parameter for the logistic regression under the LASSO penalty. Reminder: set alpha=1 to run LASSO. What are the non-zero coefficients in the LASSO regression for the optimal value of \(\lambda\)? How do they compare to the unpenalized logistic regression? Save training and test errors to the records variable.

  4. Compute ROC curves for the decision tree, logistic regression and LASSO logistic regression using predictions on the test data. Display them on the same plot. Based on your classification results, discuss the pros and cons of the various methods. Are the different classifiers more appropriate for answering different kinds of questions about the election?

Taking it further

  1. This is an open question. Interpret and discuss any overall insights gained in this analysis and possible explanations. Use any tools at your disposal to make your case: visualize errors on the map, discuss what does or doesn’t seem reasonable based on your understanding of these methods, propose possible directions (collecting additional data, domain knowledge, etc). In addition, propose and tackle at least one more interesting question. Creative and thoughtful analyses will be rewarded! _This part will be worth up to 20% of your final project grade!

Some possibilities for further exploration are: